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Abstract 

We give a new framework for proving the existence of low-degree, polynomial approxi- 
mators for Boolean functions with respect to broad classes of non-product distributions. Our 
proofs use techniques related to the classical moment problem and deviate significantly from 
known Fourier-based methods, which require the underlying distribution to have some prod- 
uct structure. 

Our main application is the first polynomial-time algorithm for agnostically learning any 
function of a constant number of halfspaces with respect to any log-concave distribution (for 
any constant accuracy parameter). This result was not known even for the case of learning 
the intersection of two halfspaces without noise. Additionally, we show that in the smoothed- 
analysis setting, the above results hold with respect to distributions that have sub-exponential 
tails, a property satisfied by many natural and well-studied distributions in machine learning. 

Given that our algorithms can be implemented using Support Vector Machines (SVMs) 
with a polynomial kernel, these results give a rigorous theoretical explanation as to why many 
kernel methods work so well in practice. 
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1 Introduction: Beyond Worst-Case Learning Models 



Learning halfspaces is one of the core algorithmic tasks in machine learning and can be solved 
in the noiseless (PAC) model via efficient algorithms for linear programming. The two simplest 
generalizations of this problem namely 1) learning the intersection of two halfspaces and 2) learn- 
ing a noisy halfspace (i.e., agnostic learning) have attracted the attention of many researchers 
in theoretical computer science and statistics. Surprisingly, they both remain challenging open 
problems. 

In the context of computational complexity, there are many hardness results for learning 
halfspace-related concept classes with respect to arbitrary distributions, and the literature is too 
vast for us to survey here. A brief summaiy might be that strong NP-hardness results are known 
for proper learning, where the learner must output a hypothesis that is of the same form (or close 
to the same form) as the unknown concept class [FGRW, GR, DOSW, KS1], and that there are 
cryptographic hardness results even for improper learning , where the learner is allowed to output 
any polynomial-time computable hypothesis [FGKP, KS2]. These hardness results apply to many 
easy-to-state problems, including the two simple generalizations of learning halfspaces enumer- 
ated above. 

There is a disconnect, however, between the many discouraging hardness results for learning 
convex sets and the success in practice of popular machine-learning tools for solving just these 
types of problems (e.g., Support Vector Machines). A reasonable question might be "Why do ker- 
nel methods- algorithms that at their core learn a noisy halfspace- work so well in practice?" The 
allusion here to Spielman and Teng's work on Smoothed Analysis [ST] is on purpose: supervised 
learning seems perfectly suited to an average-case analysis in terms of the underlying distribution 
on examples. 

Indeed, the main positive result of this paper is a smoothed analysis of learning functions of 
halfspaces: we show that, in the smoothed-analysis setting, functions of halfspaces are agnosti- 
cally learnable with respect to any distribution that obeys a subexponential tail bound (so-called 
subexponential densities) for any constant error parameter. These distributions include all log- 
concave distributions and need not be product or unimodal. Previous work (that we detail in the 
next section) required the underlying distribution to be Gaussian or uniform over the hypercube. 

We leave open the possibility that functions of halfspaces are agnostically learnable with re- 
spect to all distributions in the smoothed-analysis model (i.e., all distributions that have been 
subject to a small Gaussian perturbation). We certainly are not aware of any, say, ciyptographic 
hardness results for this setting. 

1.1 Introduction: Previous Work on Distribution-Specific Learning 

Many researchers have studied the complexity of learning convex sets with respect to fixed marginal 
distributions. Along these lines, Blum and Kannan [BK1] gave the first polynomial-time algorithm 
for learning intersections of m = O(l) halfspaces with respect to Gaussian distributions on W 1 . 
Their algorithm runs in time n ^""-* (for any constant accuracy parameter). Vempala [Vem2] 
improved on this work and gave a randomized algorithm for learning intersections of centered 
halfspaces with respect to any log-concave distribution on W 1 in time roughly (n/e) ^ ("cen- 
tered" means that each bounding hyperplane passes through the mean of the distribution). In a 
beautiful follow-up paper, Vempala [Veml] used PC A to give an algorithm for learning the inter- 
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section of m halfspaces with respect to any Gaussian distribution in time poly(n) • (m/e)°( m \ 
We note that these results hold in the PAC model, and it is not clear if they succeed in the agnostic 
setting. 

In the agnostic model, we are only aware of results that use the polynomial regression algo- 
rithm of Kalai et al. [KKMS]. Klivans et al. [KOS1] (combined with the observations in Kalai 
et al.) gave an algorithm for learning any function of m halfspaces in time n°' m l e ) with re- 
spect to the uniform distribution on {—1, 1}" . Applying results on Gaussian surface area, Klivans 
et al. [KOS2] gave an algorithm for agnostically learning intersections of m halfspaces in time 
n poiyiog(m)/e w j t j 1 reS p ec t t an y Gaussian distribution. 

A major goal in this area has been to move beyond Gaussians and tackle the case when the 
underlying distribution is log-concave, as log-concave densities are a broad and widely-studied 
class of distributions. The Gaussian density is log-concave, and, in fact, any uniform distribution 
over a convex set is log-concave. 

Kalai et al. [KKMS] give an algorithm for agnostically learning a single halfspace with respect 
to any log-concave distribution in time for some function /. The best known bound for / 
is currently 2°( 1 / £ ' (follows from Section 5 of Lubinsky [Lub]). It is unclear how to extend the 
Kalai et al. analysis to work for the intersection of two halfspaces. 

To summarize, it was not known how to learn the intersection of two halfspaces with respect 
to log-concave distributions even in the noiseless (PAC) model. 

1.2 Statement of Results 

Here we give the first polynomial-time algorithm for agnostically learning intersections (or even 
arbitrary functions) of a constant number of halfspaces with respect to any log-concave distribution 
on M n (see Table 1.2 for the precise parameters): 

Theorem 1.1. Functions of m halfspaces are agnostically learnable with respect to any log- 
concave distribution on W 1 in time n° m - e ^ where £ is the accuracy parameter. 

Admittedly, our dependence on the number of halfspaces m and the error parameter e is not 
great, but we stress that no polynomial-time algorithm was known even for the intersection of two 
halfspaces. See Table 1.2 for a summary of previous work. 

We remark that Daniel Kane in a forthcoming paper has independently obtained Theorem 1.1 
using a set of completely different techniques ([Kanl]). His dependence on m and 1/e, though 
still exponential, is superior to ours. 

We extend the above result- in the smoothed-analysis setting- to hold with respect to arbitrary 
distributions with sub-exponential tail bounds. We first define the model of smoothed-complexity 
that we consider. 

Definition 1.2. Given a distribution T> on W 1 , and a parameter a £ (0,1), letD(a) be a perturbed 
distribution of "D obtained by independently picking X <— D, Z <— A/"(0, E) n and outputting 
X + Z, where £ >z o~ • cov(X) 1 . 

That is, T>(a) is obtained by adding Gaussian noise to V and quantitatively, we want the 
variance of the noise in any direction to be comparable to (at least a 2 times) the variance of V 
in the same direction. For instance, for V isotropic, perturbations by M(0, a) n would suffice. 

1 Here, >; denotes the semi-definite ordering. 
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Figure 1: Summary of recent work on learning intersections and arbitrary functions of m halfs- 
paces 

The latter corresponds more directly to the traditional smoothed-complexity setup, but we use the 
above definition as it is basis independent and allows for non-spherical Gaussian perturbations. 

We define the smoothed-complexity of (agnostically) learning a concept class C under a distri- 
bution V to be the complexity of (agnostically) learning C under the perturbed distributions T>(a). 
This model first appears in the work of Blum and Dunagan [BD] (for the special case of spherical 
Gaussian perturbations) and we believe it to be a natural and practical extension of the traditional 
models of learning. For instance, the main motivating principle behind smoothed-analysis- that 
real data involves measurement error- is very much applicable here. Besides the work of Blum 
and Dunagan, there seems to be little known about learning in this model. 

We say a distribution is sub-exponential (sub-gaussian) if every marginal (i.e., one-dimensional 
projection) of the distribution obeys a tail bound of the form e~' z ' (e~' z , respectively). It is known 
that all log-concave distributions are sub-exponential. Sub-exponential and sub-gaussian densities 
are commonly studied in machine learning and statistics and model various real-word situations 
(see [BK2] for instance). We show that for these types of distributions, our learning algorithms 
have polynomial smoothed-complexity (for constant a): 

Theorem 1.3. Functions of m halfspaces are agnostically learnable with respect to any sub- 
exponential distribution on W 1 in time n° m ' e '' T ^ where e is the accuracy parameter and a is the 
perturbation parameter. 

We obtain much better parameters (in the constant hidden in O m>£)CT (l)) for the special case of 
sub-gaussian densities (see Theorem 4.2). 

Blum and Dunagan were the first to study the smoothed complexity of learning halfspaces. 
They showed that for a single halfspace in the noiseless (in labels) setting, the perceptron algo- 
rithm converges quickly with high probability for examples perturbed by Gaussian noise. Their 
expected running time, however, is infinite (and thus strictly speaking does not give bounds on the 
smoothed-complexity of the Perceptron algorithm). 
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To obtain our smoothed-analysis results, we prove that Gaussian perturbations provide enough 
and concentration for our polynomial approximation methods to work. We believe this connection 
will find additional applications related to the smoothed-complexity of learning Boolean functions. 

1.3 Overview of Conceptual and Technical Contributions 

In their seminal paper, Linial et al. [LMN] introduced the polynomial approximation approach 
for learning Boolean functions. The core of their approach is to solve the following optimization 
problem: given a Boolean function /, minimize, over all polynomials p of degree at most d, the 
quantity E xe{ „ 1>1} n [(/ - p) 2 ]. 

The algorithm is given uniformly random samples of the form (x, f(x)). Their "low-degree" 
algorithm approximately solves this optimization problem in time roughly n°^ d \ Later, the 
"sparse" algorithm of Kushilevitz and Mansour [KM2] solved the same optimization problem 
but where the minimization is over all sparse polynomials, and the algorithm is allowed query 
access to the function /. These algorithms were developed in the context of PAC learning. 

Kalai et al. [KKMS] subsequently observed that in order to succeed in the agnostic framework 
of learning (we formally define agnostic learning in Section 2.1 but for now agnostic learning can 
be thought of as a model of PAC learning with adversarial noise), it suffices to approximately 
minimize E xe{ _ lil} n [|/ - p\\ . 

That is, minimizing with respect to the 1-norm rather than the 2-norm results in highly noise- 
tolerant learning algorithms. Finding efficient algorithms for directly minimizing the above ex- 
pectation with respect to the 1-norm ("£\ minimization"), however, is more challenging than in the 
£2 case. The work of Kalai et al. [KKMS] gives the analogue of the "low-degree" algorithm for 
i\ minimization (in fact, their algorithm can be carried out using a Support Vector Machine with 
the polynomial kernel), and the work of Gopalan et al. [GKK] gives the analogue of the "sparse" 
algorithm for £\ minimization. 

Although we have efficient algorithms that directly carry out t\ minimization for low-degree 
polynomials, proving the existence of good low-degree i\ approximators has required first finding 
a good low-degree £2 approximator (i.e., Fourier polynomial) and then applying the simple fact 
that E[|p|] < 1^/Efp 2 ]. Directly analyzing the error of low-degree i\ approximators seems quite 
difficult. In our setting, for example, it is not even clear that the best low-degree l\ polynomial 
approximator is unique ! 

The main conceptual contribution of our methods is to provide the first framework for di- 
rectly proving the existence of low-degree l\ approximating polynomials for Boolean functions 
(in fact, we also obtain sandwiching polynomials). One benefit of our approach is that we do not 
require the underlying distribution to be product (essentially all of the techniques involving the 
discrete Fourier polynomial require some sort of product structure). As such, in this work, we 
are able to reason about approximating Boolean functions with respect to interesting non-product 
distributions, such as log-concave densities. 

In the following descriptions, we assume we are trying to show polynomial approximations 
for / : R n — > {0, 1}, where / = g{h\(x), . . . , h m (x)), where g : {0, l}' m — > {0, 1} is an arbitrary 
Boolean function and hi, ... , h m : M. n — > {0, 1} are halfspaces. 
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1.4 A "Moment-Matching" Proof 



Our method uses ideas from probability theory and linear programming to give a framework for 
proving the existence of sandwiching polynomials (it is easy to see that sandwiching polynomials 
are stronger than l\ approximators). The main technical contribution is to show how to use a set 
of powerful theorems from the study of the classical moment problem to apply our framework 
to functions of halfspaces. At a high level, our approach makes crucial use of the following 
consequence of strong duality for semi-infinite linear programs: let V be a distribution and let Vk 
be any distribution where all moments of order less than or equal to k match those of V. If Ex>[/] 
is "close" to Ex> fe [/] then / has a low-degree sandwiching polynomials with respect to D. The 
question then becomes how to analyze the bias of a Boolean function where only the low-order 
moments of a distribution have been specified. We show how to use several deep results from 
probability to answer this question in Sections 3.2 and 3.3. 

We show that the moment-matching approach also has some interesting applications for learn- 
ing with respect to distributions on the discrete cube {—1, +l} n . 

2 Preliminaries 

2.1 Agnostic Learning 

We recall the model of agnostically learning a concept class C [Hau], [KSS]. In this scenario there 
is an unknown distribution V over W 1 x { — 1,1} with marginal distribution over W 1 denoted T>x- 

def 

Let opt = infy e e ~PT( x , y )~v[f{ x ) 7^ y]', ie - °pt is the minimum error of any function from C in 
predicting the labels y. The learner must output a hypothesis whose error is within e of opt: 

Definition 2.1. Let D be an arbitrary distribution on W 1 x { — 1,1} whose marginal over W 1 is 
T>x, and let C be a class of Boolean Junctions f : W l — > { — 1, 1}. We say that algorithm B 
is an agnostic learning algorithm for C with respect to T> if the following holds: for any D as 
described above, if B is given access to a set of labeled examples (x, y) drawn from T>, then 
with probability at least 1 — 5 algorithm B outputs a hypothesis h : M. n — > { — 1, 1} such that 

^{x,y)~v\K x ) ¥"y]< °pt + £■ 

Note that PAC learning is a special case of agnostic learning (the case when opt = 0). 

The "L\ Polynomial Regression Algorithm" due to Kalai et al. [KKMS] shows that one can 
agnostically learn any concept class that can be approximated by low-degree polynomials (in 
Kalai et al. [KKMS] it is shown how to implement this algorithm using a standard SVM with the 
polynomial kernel) : 

Theorem 2.2 ([KKMS]). Fix T> on X xi and let f G C. Assume there exists a polynomial p of 
degree d such that E Xr ^x) x i\fi x ) ~P( X )\] < e where T>x is the marginal distribution on X. Then, 
with probability 1 — 5, the L\ Polynomial Regression Algorithm outputs a hypothesis h such that 
Pr (x,y)~v[H x ) ^y]<opt + e in time poly(n d /e, log(l/<5)). 

Throughout, we suppress the poly(log(l/<5)) dependence on 5. 
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2.2 Probability 

For a random variable X G W 71 , let tpx '■ M m — > R be the characteristic function defined by 
tyxif) = E[exp(— i(t, x))], where i = y/—l. 

We shall use the following standard distance measures between random variables 1,7 6 R m . 

• The A-metric: 

d x {X, Y) = mm max{ max {\<p x {t) - Vy (t)|}, l/T}. 

• The Levy distance: for 1 being the all l's vector, 

d LV (X, Y) = inf {Vt G R m , Pr[X < t - el] - e < Pr[Y < t] < Pr[X < t + el] + e}. 

e>0 

• Kolmogorov-Smimov or cdf distance: 

d cdf (X,F) = sup {| Pr[X > i] - Pr[Y > t]\}. 

teR m 

For I = (ix, ...,%) € Z n , and s G M n , let s(J) = n"=i • For jb > 0, let I(k, n) = {I = 
(h,...,i n ) SZ": E"=i*j < *j > 0}- 

We say that a class of functions C is e-approximated in l\ by polynomials of degree d under 
a distribution V if for every / G C, there exists a degree d polynomial p such that E Xr ^x>[\p{x) — 

f(?)\]<e- 

We use the following properties of log-concave distributions (equivalent formulations can be 
found in Lovasz-Vempala [LV]). 

Theorem 2.3 ([CW]). Let random-variable X G W 1 be drawn from a log-concave distribution. 
Then, for every w G W 1 , and r > 0, E[\{w,X)\ r ] < r r ■ E[(w, X) 2 ] r / 2 . 

Theorem 2.4 ([CW]). There exists a universal constant C such that the following holds. For any 
real-valued log-concave random variable X with E[X 2 ] = 1 and all t G M, £ > 0, Pr[X G 
[t, t + e]]< Ce. 

We also use the following simple lemmas. The first helps us convert closeness in Levy distance 
to closeness in cdf distance, while the second helps us go from fooling intersections of halfspaces 
to fooling arbitrary functions of halfspaces. 

Fact 2.5. Let X = (X\, . . . ,X m ) G ffi m be a random variable such that for every r G [m], 
t G M, e > 0, Pr[X r G [t, t + e]] < f3 ■ e for a fixed f3 > 0. Then, for any random variable Y, 
d cdf (X,Y)<m-f]-d L y(X,Y). 

Lemma 2.6. Let X,Y G M. m be real-valued random variables such that for every a±, . . . , a m G 
{1, -1}, d cdf ( (a 1 X 1 ,a 2 X2, . . .,a m X m ), (a 1 Yi,a 2 Y 2 , . . . ,a m Y m ) ) < e. Then, for any function 
g : {l,-l} m -> {1,-1} and thresholds 9^ ... ,9 m , \ E[g( sign {X 1 -9 1 ),..., sign (X m -9 m ) )]- 
E[g(s\gn(Y 1 - ± ), s\gn(Y m - 6 m ) )] \ < 2 m e. 

Proof. Fix 6i, . . . , 6 m and let X' = (signpTi — 0\), . . . , sign(X m — m )) and define Y' similarly. 
Then, from the assumptions of the lemma, for every a G {1, — l} m , 

| Pr[X' = a]-Pr[Y' = a]\ < d cdf ((a 1 X 1 ,a 2 X 2 , . . .,a m X m ), (aiYi, a 2 Y 2 , . . . ,a m Y m )) < e. 

Therefore, dTv(^') Y') < 2 m ~ 1 e. The lemma now follows. □ 
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3 Moment-Matching Polynomials 



We develop a theory of "moment-matching polynomials" for showing the existence of good ap- 
proximating polynomials. Our main result is the following. 

Theorem 3.1. Let T> be a log-concave distribution over W 1 . Let hi, ... , h m : R n — > {1, —1}, be 
halfspaces and let g : {1, — l} m — > {1, —1} be an arbitrary function. Define f : K n — > {1, — 1} 
by f(x) = g((hi(x), . . . , h m {x))). Then, there exists a real-valued polynomial P of degree at 
most k = exp((log((log m) /e))°( m ) /e 4 ) such that E x ^d [\f(X)~ P{X) | ] < e. 

Theorem 1.1 with the runtime given in Table 1.2 follows from the above result and Theo- 
rem 2.2. The theorem is proved in Section 3.3. We start by describing the two basic ingredients: 
LP-duality and the classical moment problem. 

3.1 LP Duality 

It is now well known in the pseudorandomness literature that with respect to the uniform distri- 
bution over {— 1, l} n , a concept class C has degree k sandwiching polynomials if and only if C 
is fooled by fe-wise independent distributions [Baz]. The proof of this fact follows from LP du- 
ality where feasible solutions to the primal are fc-wise independent distributions and feasible dual 
solutions are approximating polynomials. 

In our setting, we consider continuous distributions over W 1 that are not necessarily product. 
As such, this equivalence is more subtle. In fact, it is not even clear how to define £>wise indepen- 
dence for non-product distributions (such as log-concave densities). Still, given a distribution V 
we can write a semi-infinite linear program (a program with infinitely many variables but finitely 
many constraints) whose feasible solutions are distributions that match all of P's moments up to 
degree k (in the case where V is uniform over {—1, l} n , matching all moments is equivalent to 
being /c-wise independent). 

For / G I(k, n), let 07 = ~Ex^v[X(I)]. Let / G C. We write the primal program as follows: 



The supremum is over all probability measures on R fc . As in the finite dimensional case, feasible 
solutions to the dual program correspond to degree k approximating polynomials. The dual can 
be written as 



sup 

A* 





(3.1) 



inf 




(3.2) 



I<=I(k,n) 




(3.3) 



1 
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The issue here is that in general, strong duality does not hold for semi-infinite linear programs. In 
our case, however, where the oi 's are obtained as moments from a distribution V (as opposed to 
just arbitrary reals), it turns out that strong duality does hold. To see this, we note that the above 
primal LP is a special case of the so-called generalized moment problem LP, a classical problem 
from probability and analysis that asks if there exists a multivariate distribution with moments 
specified by the crj's. In our case, feasibility is immediate, as the <7j's are obtained from V. 

As for strong duality, it is known that if the a 's are in the interior of a particular set (the details 
are not relevant here), then the optimal value of the primal equals the optimal value of the dual. In 
the case that the crj's do not satisfy this condition, strong duality holds assuming we relax the dual 
program constraints to some subset SJC1". One concern is that we will now obtain an optimal 
approximating polynomial with respect to some distribution V' defined on Q (as opposed to the 
original V). But it is also known that in this case, all feasible distributions are supported on Q. As 
such, approximation with respect to V is equivalent to approximation with respect to V. We refer 
the reader to Bertsimas and Popescu [BP] (Section 2) for more details and references. 

We start with an important definition. 

Definition 3.2. Given two distributions T>,T>' on M. n , k > 0, we say T>' k moment-matches T> if 
for all I G I(k,n), E X ^[X(I)] = E X ^[X(I)}. 

We can now prove the main lemma of this section: 

Lemma 3.3. Let f : R n — > {0, 1} and let T> be a distribution over M n with all moments finite such 
that the following holds: For every distribution V that k moment-matches T>, \ Ex^v[f(X)] — 
Ex<-v[f(X)]\ < e. Then, there exist degree at most k polynomials Pi, P u : W 1 — > M, such that 

• For every x G Support(V), Pi(x) < f{x) < P u (x). 

• ForX <- V, E[P U (X)] - E[f(X)\ < e and E[/(X)] - E[P e (X)] < e. 

Proof. Let opt* be the value of the primal program Equation 3.1. Then, by hypothesis opt* < 
7 + e, where 7 = E x ^v[f(X)}. 

Now, from the above discussion, strong duality (almost) holds for the programs in Equations 
(3.1) and (3.2), and we conclude that there exists a dual solution a G M 7 ( fc > n ) with value exactly 
opt* that satisfies the inequality constraints for all x G Support{V). Define, 

P u (xi, . . . ,x n ) = ^2 a/x(I). 

Iel(k,n) 

Then, P u { ) is a degree at most k polynomial, and P u (x) > f(x) for every x G Support{V). 
Further, the assumption in the lemma implies Ex<-t>[Pu{X)] = J2iei(k n) a i a i = °Pt* < 7 + £ - 
We have the existence of the lower sandwiching polynomial Pi similarly. □ 

3.2 The Classical Moment Problem 

In the previous section, we reduced the problem of constructing low-degree sandwiching polyno- 
mial approximators with respect to V to understanding the optimal value of a semi-infinite linear 
program. The feasible solutions of the linear program correspond to all distributions that are k 
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moment-matching to V. As such, for any k moment-matching distribution V we need to bound 
I E-d [/] — Ex>' [/] | ■ In this section, we give some techincal results that help us bound this difference 
provided the moments of V do not grow too fast. 

We begin with the following result showing that multivariate distributions whose marginals 
have matching lower order moments have close characteristic functions (as quantified by A-metric) 
provided the moments are well behaved. 

Theorem 3.4 (Theorem 2, Page 171, [KR]). Let X, Y G R m be two random variables such that 
for any t G W 71 , the real-valued random variables (t, X), (t, Y) have identical first 2k moments. 
Then, for a universal constant C, 

d x (X,Y)<Cf3- 1/4 (l + ^X) 1 /*), 

where N (X) = sup{E[|(t,x)p] : t G R m , \\t\\ < 1}, and (3 k = P h {X) = £* =1 1/ fi^X) 1 ^ . 

We now need to convert the above bound on closeness of characteristic functions to more 
direct measures of closeness like Levy or Kolmogorov-Smirnov metrics. Such inequalities play 
an important role in Fourier theoretic proofs of limit theorems (eg., Esseen's inequality; cf. Chapter 
XVI [Fel]) and here we use the following multi-dimensional version due to Gabovich [Gab]. 

Theorem 3.5 ([Gab] Equation (8)). Let X, Y G W 71 be two vector-valued random variables. 
Then, for a universal constant C and all sufficiently large N, T > 0, 

d LV (X,Y)< / C ^ x{{h -- tm)) -^-- t ^ d t 1 ...dt m+ 

C(logT)(\og{NT)) + p ^ x ^ ^ N ^ + p( _^ ^ ^ 

The above theorem leads to the following concrete relation between d>, and d|_v- 

Lemma 3.6. Let X, Y be two vector-valued random variables with d\(X, Y) < 5. Let N(e) G M. 
be such that Pr[X <£ [-JV(e), N(e)} m ], Pr[Y <£ [-N(e), N(e)] m ] < 5. Then, 

d L \/(X, Y)<0 ((log N(5) + 21og(l/5)) m • 5) . 

Proof. Without loss of generality suppose that 5 < 1/m 2 , as else the statement is trivial. Let T* 
be the value of T that attains the minimum in the definition of d^: 

d A (X, Y ) = max{ max J | <p x (t) - <p Y (t) | } , 1/T* } . 

As d x (X,Y) <5,T* > 1/5. Therefore, for every t G M. m with \\t\\ < 1/5, \<fx(t) - <py(t)\ < 5. 
Thus, applying Theorem 3.5 with N = N(5) and T = 1/5^/m, we get 

d LV (X, Y) < C I Wx(t)-VY(t)\ dt + 0{log 2 {NT) . dyM) + 0{§) 

J 7 $ T <ti,...,t m <T h-'-tm 

<C [ dt + 0(log 2 (NT) • 5^) 

J jp F <t l ,...,t m <T h---t m 

< C5 ■ (log N + 2 log T) m + 0(log 2 (NT) • 

= 0((\ogN(5) + 2\og(l/5)) m -5). 

□ 
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3.3 Low-Order Moments, Functions of Halfspaces, and Log-Concave Densities 

We are now ready to complete the proof of the main theorem for learning functions of halfspaces 
with respect to log-concave distributions - Theorem 1.1. We do so by using the tools from the 
previous section on moment bounds to analyze the optimum value of the primal LP from Section 
3. This will imply low-degree l\ approximators with low error for any / G C. We can then apply 
known results due to Kalai et al. [KKMS] (Theorem 2.2) relating approximability by low-degree 
polynomials and agnostic learning. 

Proof of Theorem 3.1. Without loss of generality suppose that V is in isotropic position. We can 
do so, as any distribution can be brought to isotropic position by an affine transformation and the 
class of intersections of halfspaces is invariant under affine transformations. 

Let halfspace hi : W L — > {1, —1} be given as hi{x) = sign((-u^, x) — 8{) for Wi G W 1 with 
\\wi\\ = 1 and 6i G R. 

Let X <— V and let X' <— V , where V is any distribution that is 2/c-moment matching to 
V for k = 2°(( m / £ ) 4 ) to be chosen later. Let Y = {(w u X), (w 2 ,X), ■■■ , (w m ,X)) and Y' = 
((wi,X'), • • • , (w m , X')). Observe that for every t G R m , the first 2k moments of (t, Y), (t, Y') 
are identical. Thus, we can apply Theorem 3.4 to the random variables Y, Y'. For t G M m , 

||*|| = 1,3 > 0, 

m mm 

n\{t,Y)\ j ) = E[\(J2t r Wr,X}\i] < ji-nCEtrW r ,X) 2 y/ 2 = f • Wj^trWrf <fm j , 

r=l r=l r=l 

where the first inequality follows from Theorem 2.3 and the second equality from X being isotropic. 
Therefore, for Hj(Y) and Pk as defined in Theorem 3.4, 

k k 

a = £^^££^ = "((i°^>/™>- 0.4) 

We now wish to get a good estimate on N{5) as defined in Lemma 3.6. From Theorem 2.3 and 
Markov's inequality, for every a > 0, and r G [m], j < 2k even 

PrflK ,x,i> Q)£ %^<^. 

Therefore, for j = log(m/5), and a = 2j, Pr[\(w r , X)\ > 2j] < 5/m. Thus, by using a union 
bound over all the components of Y, for N = 2j = 2 \og{m/8), 

Pr[y i [-N,N] m ] < 5. (3.5) 

As the above calculation only involved the first 2k moments of X, the same property should hold 
for Y' . From Equations (3.4), (3.5) and Theorem 3.4, 
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Let k = 2°( m / 54 ) be large enough so that the above error bound is d\(Y,Y') < 5. Therefore, 
from Lemma 3.6, 

dLv(y,y')<0og(0ogm)/5))°H -5. 

Now observe that by Theorem 2.4, for every r E [m], t E R, a > 0, Pr[y r E [t, i + a]] = 0(a). 
Thus, from the above equation and Fact 2.5, 

dcdf(Y,Y') < 0(m ■ d LV (Y,Y')) = (log((logm)/5))°( m ) • S = e. (3.7) 

Since the above argument worked for any weight vectors w±, . . . , w m E M m , a similar argu- 
ment applied to weight vectors a\W\, C12W2, ■ ■ ■ , a m Wm for a E {1, — l} m , gives 

dcdf ((ai^i, • • • , a m Y m ), (a{Y{, a m Y^)) < e. 

Therefore, by Lemma 2.6 applied to Y, Y' and g, \ E[f(X)] - E[f(X')]\ < 2 m e. 

Hence, by Lemma 3.3, for P = P u a degree at most k polynomial as in Lemma 3.3, 

E[\P(X) - f(X)\] = E[P(X)] - E[f(X)} < 2 m e. 

The theorem now follows from setting e = e' /2 m as k = 2 0{ - m /^ = 2( lo g(( lo g m )/ £ )° <m) )/ £4 . □ 

4 Smoothed Complexity of Learning Functions of Halfspaces 

We now consider the smoothed complexity of learning convex sets defined by intersections of 
halfspaces and extend our learning results to handle any distribution whose marginals obey a 
subexponential tail bound. We feel this is a mild restriction to place on the distribution. It is well 
known that any (isotropic) log-concave distribution obeys such a tail bound. 

Our high level approach will be similar to that for log-concave densities: we use moment 
bounds and results from Section 3.2 to show that functions of halfspaces cannot distinguish 
(smoothed) distributions with strong moment bounds. Adding a Gaussian perturbation plays an 
important role in our setting, by essentially allowing us to impose certain probabilistic margin 
constraints in the form of anti-concentration bounds. One interpretation of our results is that in the 
setting of smoothed-analysis, learning geometric classes becomes easier in many cases because the 
underlying Gaussian perturbation makes the distribution anti-concentrated (i.e., no sharp peaks) 
"for free." 

We first state our results and defer the proofs to the following sections. 

Theorem 4.1. Let T> be a sub-exponential distribution over W n . Let hi, ... , h m : W 1 — > {1,-1}, 
be halfspaces and let g : {1,— l} m — > {1, — 1} be an arbitrary function. Define f : M n — > {1,-1} 
by f(x) = g{{h\ (x), . . . , h m (x))). Then, for every a > 0, there exists a polynomial P of degree 
at most 

k = exp((log((logm)/ae)) 0(m) /(^) 4 ) 
such that E x ^ V {a) [ I f{X) - P(X) \ } < e. 

Theorem 4.2. LetVbe a sub-Gaussian distribution overW 1 . Let hi, ... , h m : M n — > {1,-1}, be 
halfspaces and let g : {1, — l} m — > {1, —1} be an arbitrary function. Define f : M n — > {1, —1} 
by f(x) = g((hi(x), . . . , h m (x))). Then, for every a > 0, there exists a real-valued polynomial P 
of degree at most k = (log((log m) /ae)) 0(m) /(ere) 4 such that E x ^v{a) [ I f(X) - P(X) \ ] < e. 
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Theorem 1.3 and the precise runtimes as given in Table 1.2 follow from the above results and 
Theorem 2.2. 

4.1 Sub-Exponential Densities 

In this section we study sub-exponential densities and prove Theorem 4. 1 . 

Definition 4.3. We say an isotropic distribution T> on W l is sub-exponential if there exist constants 
C,a > 0, such that for every w £ W 1 , \\w\\ = 1, and t > 0, 

Pr [\{w,X)\ >t]< CeM-at). 

More generally, we say a distribution T> on R n is sub-exponential if the isotropic distribution 
obtained by putting T> in an isotropic position by an affine transformation is sub-exponential. 

We shall use the following standard fact giving strong moment bounds for random variables 
with sub-exponential tails. 

Fact 4.4. Let X be unit variance random variable such that Pr\\X\ > t] < C exp(-at). Then, 
for all k > 0, K[\X\ k ] < C(k/a) k . 

Finally, we need the following fact showing that convolving any distribution with a Gaussian 
distribution leads to anti-concentration. 

Fact 4.5. For any real-valued random variable X, Z <— M(0, a) and t£l, a > 0, Pr[X + Z £ 

[t, t + a)] < Ca/cr, where C is a universal constant. 

Proof. Fix ( £ R and a > 0. Then, 

x 2 

rt+a e -^z rt+a l 

Pr[Z e [t,t + a)] = / dx < / dx = Ca/a, 

Jt V2na 2 Jt v27tcj 2 

where C = 1/ y/2n. The claim now follows from: 
Pr[X + Z € [t, t + a)] = E[ Pr[Z £ [(* - X), (t - X) + a)] } < E[Ca/a] < Ca/a. 

X Z X 

□ 

Proof of Theorem 4.1. The proof follows the same approach as that of Theorem 3.1. Without loss 
of generality, we can suppose that V is in isotropic position as functions of halfspaces are closed 
under affine transformations. 

Let the Gaussian perturbation be Z <— J\f(0, S) m , where £ >z a\ m . We next renormalize the 
distribution V so that V(a) is in isotropic position. Note that T>(a) is also sub-exponential. This 
follows from a simple union bound. For any direction w £ W 71 , \\w\\ = 1, and X ^— V and 
Z <- A/"(0,£) m , 

Pr[\{X + Z,w)\ >t]< Pr[\{X,w)\ >t/2] + Pr[|(Z,«;)| > t/2] = 0(exp(~n(t))), 
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where the last inequality follows from the fact that X is sub-exponential by definition and that the 
uni-variate Gaussian distribution is sub-exponential. 

Fix halfspaces hi : R™ — > {1, —1} and let random variables X <— V(a) and let X' <— V, 
where V A;-moment matches V for k to be chosen later. Let Y, Y' be as in Theorem 3.1. Then, by 
Fact 4.4, for any w G R n , \\w\\ = 1, E[\(w, X)p'] < C(j/a) j . 

Observe that the proofs of Equations (3.4) and (3.5) in Theorem 3.1 only used moment bounds 
for log-concave distributions, and sub-exponential distributions have similar bounds on moments. 
Thus, by similar arguments, for k = 2°( m / <5 ) sufficiently large, we get 

Hj(X) < C(jm/a) j , fa{Y) = n((logfc)/m), (4.1) 
and for N = 0(\og(m/S)/a) sufficiently large, 

Pr[Y i [-N,N] m ] + Pr[Y' $ [-N,N} m ] < 25. 

Combining the above two equations and Lemma 3.6, we have 

d A (Y, Y') = (log(log(m/<5))) ( m ) • 5. (4.2) 

Now, note that for any r G [m], Y r can be written as Y', + Z r , where Z r <— M(0, a) is independent 
of YJ. Therefore, by Fact 4.5, Pr[Y r G [t, t + 7]] = 0(7/(7) for t G R, a > 0. Thus, by the above 
equation and Fact 2.5, 

dcdf(^,n = 0(md x (Y,Y')/a) = (log(log(m/5)))°M • S/a = e. 

The theorem now follows from an argument similar to that of Theorem 3.1 following Equation 3.7. 

□ 

4.2 Sub- Gaussian Densities 

We now study sub-Gaussian densities and show an analogue of Theorem 4.1 with much better 
parameters. The improvement in parameters comes from the fact that sub-Gaussian have much 
more tightly controlled moments. 

Definition 4.6. We say an isotropic distribution T> is sub-Gaussian if there exist constants C,a > 
0, such that for every w G R n , = 1, and t G R 

Pr [\( Wl X)\>t]<CeM-»t 2 ). 

Ji. i — U 

More generally, we say a distribution T> on R n is sub-exponential if the isotropic distribution 
obtained by putting D in an isotropic position by an affine transformation is sub-exponential. 

Analogous to Fact 4.4, we have the following statement for sub-gaussian densities. 

Fact 4.7. Let X be unit variance random variable such that Pr[\X\ > t] < C exp(— at). Then, 
E[\X\ k ] < C(k/a 2 ) k l 2 . 
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of Theorem 4.2. The proof follows the same approach as that of Theorem 4.1. We only highlight 
the important differences. Fix halfspaces hi : W 1 ' — > {1, —1}, and random variables X, X', Y, Y' 
as in the proof of Theorem 4.1. Now, observe that for k = Q(m/5 A ), sufficiently large, for any 
t e R m , \\t\\ = 1, and j > 0, 

m 

E[\(t,Y)\i]=E[\C£trWr,X)\i} 

r=l 

m 

< CjH 2 ■ E[(J2 Uw r , X) 2 ]^ 2 /a j (Fact 4.7) 

r=l 

= 0{{m/a) j ■ j j/2 ). 



Therefore, 



= E rvM/2,- ^ E = «(V*/m). (4.3) 



J=1 j= 

Note that the above bound on is exponentially better than the f2(log fc) bound we had for 
log-concave and sub-exponential densities and this leads to the quantitative improvements for 
sub-Gaussian densities. 



Now, by using Markov's inequality it follows that for k > log(m/<5), and N = 0{^\og{m/ 5) / a) 
sufficiently large, 

Pr[y $ [-N,N] m ] + Pr[y' $ [-N,N] m ] < 25. 
Combining the above two equations and Lemma 3.6, we get 

d x (Y,Y') = (log(log(m/5)))°^ -6. 

The theorem now follows from the above inequality and an argument similar to that of Theo- 
rem 4.1 following Equation 4.2. □ 



4.3 Non-Product Distributions on Hypercube 

Learning intersections of halfspaces with respect to distributions on the hypercube is a long- 
standing and fundamental open problem in learning theory. To date, most non-trivial results per- 
tain to product distributons on the hypercube, with the exception of the work of Wimmer [Wim] 
who can handle symmetric distributions on the hypercube. 

Our results imply algorithms for agnostically learning functions of halfspces in the smoothed 
complexity setting for distributions on the hypercube that are locally-independent. Specifically, 
call a distribution Don {1, —1}™ /c-wise independent if for any I C [n], |/| < k, X ^— V, the 
variables (Xj : i G J) are independent. (This is the same as saying V fc-moment matches the uni- 
form distribution on {1, — l} n ). Our learning algorithms for sub-Gaussian densities, Theorem 4.2, 
immediately imply the following for learning with respect to fc-wise independent distributions. 

Theorem 4.8. For all m, e, a there exists k = O mj£]Cr (l) such that the following holds. Functions 
ofm halfspaces are agnostically learnable with respect to any k-wise independent distribution on 
{1, — l} n in time n ™ 6 ^ 1 ) where e is the accuracy and a is the perturbation parameter. 
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In contrast, it is not clear if any of the previous techniques can give algorithms for learning 
intersections of halfspaces that are even r2(n)-wise independent. 

Proof. The uniform distribution on {1, — l} n is known to be sub-Gaussian [Pin]. Further, ob- 
serve that in the proof of Theorem 4.2 we only used properties of the first fc-moments for k = 
(log ((log m) / ae)) 0<ym ^ /(ere) 4 . Thus, the same arguments should work for any distribution T> 
which is fe-wise independent. The thoerem then follows from combining the direct analogue of 
Theorem 4.2 for fc-wise independent distributions V with Theorem 2.2. □ 



5 Bounded Independence Fools Degree Two Threshold Functions 

Here we show that the methods of Section 3 can also be used with respect to the uniform distribu- 
tion over {1, — l} n . We use the moment-matching techniques to give a new proof for the recent 
result of Diakonikolas, Kane, and Nelson [DKN] that bounded independence fools degree-2 poly- 
nomial threshold functions. Our proof gives worse parameters, but is considerably different and is 
perhaps simpler. We also establish a connection between the pseudorandomness problem and the 
well studied classical moment problem in probability (see [Akh] for instance). 

Theorem 5.1. There exist constants C, C such that the following holds. Let T> be a m-wise 
independent distribution over {1, -1}" for m = 2 C ' S '\ Then, for every degree 2 polynomial 
P : R n -^M,andx ^ V, y G u {1, -l} n ,d cdf (P(x), P(y)) < C'S. In other words, (2°( 1 /<5 9 )). 
wise independence 5-fools degree two threshold functions. 

In comparison, Diakonikolas et al. show that 0(5~ 9 )-wise independence suffices. This bound 
was later improved to 0(5~ s ) in [Kan2]. 

We shall use the following quantitative estimate due to Klebanov and Mkrtchyan which can 
be seen as a one dimensional version of Theorem 3.4, albeit with better parameters. 

Theorem 5.2 (Theorem 1, [KM1]). Let X,Y be real-valued random variables with K[X l ] = 
for 1 < i < 2m and K[X 2 ] = 1. Then, for a universal constant C > 0, 

dLv(x ' y) - ]uxW • 

We only detail the case of regular polynomials here, the reduction from the general case to 
the regular case works via the regularity lemma of Harsha et al., [HKM] and Diakonikolas et al., 
[DSTW]. 

Definition 5.3. A multi-linear polynomial P : W n — > R, P(x) = J^/c[n] a i T\i£i x i ™ ^-regular 
if for every i G [n], 

i=1 \IC[n],l3i J 

where ||P||| = Y^i a j- 

Theorem 5.4. There exist constants C, C such that the following holds. Let T> be a m-wise inde- 
pendent distribution over {1, -\} n for m = 2 c / &2 . Then, for every 5-regular degree 2 polynomial 
P : R n R, and x <- V, y G u {1, -l} n , d cdf (P(x), P{y)) < C'S 2 ' 9 . 
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Theorem 5.1 follows from the above theorem and the regularity lemma of Harsha et al., Di- 
akonikolas et al. We refer the reader to the work of Meka and Zuckerman [MZ] for a similar 
reduction of the general case to the regular case in the pseudorandomness context and omit it here. 

To prove Theorem 5.4 we use the following results about low-degree polynomials. The lemma 
gives us control on how fast the moments of low-degree polynomials grow. 

Theorem 5.5 (Hypercontractivity, [LT]). For 1 < p < q < oo, and P : R n — > R a degree d 
polynomial, the following holds: 



/ _ i \ d / 2 

E [|P(X)|«]V« < ( q —r) y E [\P(X)n^. (5.1) 



The next two theorems helps us get anti-concentration bounds for regular polynomials over 
the hypercube. 

Theorem 5.6 (Mossel et al. [MOO]). There exists a universal constant C such that the following 
holds. Let P : R n — >■ R be a degree d 5-regular (multi-linear) polynomial. Then, for x £ u 
{1,-1}" andy <- Af(0,l) n , 

A^{P{x),P(y))<Cd5 2 ^ Ad+l \ 

Theorem 5.7 (Carbery and Wright [CW]). There exists an absolute constant C such that for 
any polynomial Q of degree at most d with \\Q\\ = 1 and any interval ICR of length a, 
Pr x ^ i0 ,i)4Q(X)eI}<Cda 1 / d . 

of Corollary 5.4. It suffices to show the statement when V is 4m-wise independent for m = 2 C ^ 2 
for C to be chosen later. Without loss of generality suppose that ||P|| = 1. Let random variables 
X = P(x), fovx^V and Y = P{y), for y e u {1, -l} n . Then, E[JP] = E[T*] for i < 2m as x 
is 4m-wise independent and P is a degree 2 polynomial. Now, for i < m, by hypercontractivity, 
Theorem 5.5, applied to q = i, d = 2, 



Therefore, 



By Theorem 5.2, 



E[X 2i ] = E[Y 2t ] < (2i) 2i . 



m - m ^ 

4=1 L 1 1=1 



m 



d LV (X,Y) = 



log log m 



(logm) 1 ^^ 

Now, by Theorem 5.6 and Theorem 5.7 applied to degree d = 2, sup t Pr[Y £ [t,t + a]] = 
0{5 2 / 9 + s/a). Therefore, by Fact 2.5 

V (logm) 1 /* J 

The statement now follows by choosing C to be sufficiently large. □ 
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